Natural Temporal Difference Learning
نویسندگان
چکیده
In this paper we investigate the application of natural gradient descent to Bellman error based reinforcement learning algorithms. This combination is interesting because natural gradient descent is invariant to the parameterization of the value function. This invariance property means that natural gradient descent adapts its update directions to correct for poorly conditioned representations. We present and analyze quadratic and linear time natural temporal difference learning algorithms, and prove that they are covariant. We conclude with experiments which suggest that the natural algorithms can match or outperform their non-natural counterparts using linear function approximation, and drastically improve upon their non-natural counterparts when using non-linear function approximation. Introduction Much recent research has focused on problems with continuous actions. For these problems, a significant leap in performance occurred when Kakade (2002) suggested the application of natural gradients (Amari 1998) to policy gradient algorithms. This suggestion has resulted in many successful natural gradient based policy search algorithms (Morimura, Uchibe, and Doya 2005; Peters and Schaal 2008; Bhatnagar et al. 2009; Degris, Pilarski, and Sutton 2012). Despite the successful applications of natural gradients to reinforcement learning in the context of policy search, it has not been applied to Bellman-error based algorithms like residual gradient and Sarsa(λ), which are the de facto algorithms for problems with discrete action sets. A common complaint is that these Bellman-error based algorithms learn slowly when using function approximation. Natural gradients are a quasi-Newton approach that is known to speed up gradient descent, and thus the synthesis of natural gradients with TD has the potential to improve upon this drawback of reinforcement learning. Additionally, we show in the appendix that the natural TD methods are covariant, which makes them more robust to the choice of representation than ordinary TD methods. In this paper we provide a simple quadratic-time natural temporal difference learning algorithm, show how the idea of compatible function approximation can be leveraged Copyright c © 2014, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. to achieve linear time complexity, and prove that our algorithms are covariant. We conclude with empirical comparisons on three canonical domains (mountain car, cartpole balancing, and acrobot) and one novel challenging domain (playing Tic-tac-toe using handwritten letters as input). When not otherwise specified, we assume the notation of Sutton and Barto (1998). Residual Gradient The residual gradient (RG) algorithm is the direct application of stochastic gradient descent to the problem of minimizing the mean squared Bellman error (MSBE) (Baird 1995). It is given by the following update equations: δt = rt + γQθt(st+1, at+1)−Qθt(st, at), (1) θt+1 = θt − αtδt ∂δt
منابع مشابه
Control of Multivariable Systems Based on Emotional Temporal Difference Learning Controller
One of the most important issues that we face in controlling delayed systems and non-minimum phase systems is to fulfill objective orientations simultaneously and in the best way possible. In this paper proposing a new method, an objective orientation is presented for controlling multi-objective systems. The principles of this method is based an emotional temporal difference learning, and has a...
متن کاملNatural actor-critic algorithms
We present four new reinforcement learning algorithms based on actor–critic, natural-gradient and function-approximation ideas, and we provide their convergence proofs. Actor–critic reinforcement learning methods are online approximations to policy iteration in which the value-function parameters are estimated using temporal difference learning and the policy parameters are updated by stochasti...
متن کاملIncremental Natural Actor-Critic Algorithms
We present four new reinforcement learning algorithms based on actor-critic and natural-gradient ideas, and provide their convergence proofs. Actor-critic reinforcement learning methods are online approximations to policy iteration in which the value-function parameters are estimated using temporal difference learning and the policy parameters are updated by stochastic gradient descent. Methods...
متن کاملNatural-Gradient Actor-Critic Algorithms
We prove the convergence of four new reinforcement learning algorithms based on the actorcritic architecture, on function approximation, and on natural gradients. Reinforcement learning is a class of methods for solving Markov decision processes from sample trajectories under lack of model information. Actor-critic reinforcement learning methods are online approximations to policy iteration in ...
متن کاملCrop Land Change Monitoring Based on Deep Learning Algorithm Using Multi-temporal Hyperspectral Images
Change detection is done with the purpose of analyzing two or more images of a region that has been obtained at different times which is Generally one of the most important applications of satellite imagery is urban development, environmental inspection, agricultural monitoring, hazard assessment, and natural disaster. The purpose of using deep learning algorithms, in particular, convolutional ...
متن کاملFactored Temporal Difference Learning in the New Ties Environment
Although reinforcement learning is a popular method for training an agent for decision making based on rewards, well studied tabular methods are not applicable for large, realistic problems. In this paper, we experiment with a factored version of temporal difference learning, which boils down to a linear function approximation scheme utilising natural features coming from the structure of the t...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014